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Abstract 

In information retrieval, a fundamental goal 
is to transform a document into concepts 
that are representative of its content. The 
term "representative" is in itself challenging 
to define, and various tasks require differ- 
ent granularities of concepts. In this paper, 
we aim to model concepts that are sparse 
over the vocabulary, and that fiexibly adapt 
their content based on other relevant se- 
mantic information such as textual structure 
or associated image features. We explore 
a Bayesian nonparametric model based on 
nested beta processes that allows for inferring 
an unknown number of strictly sparse con- 
cepts. The resulting model provides an in- 
herently different representation of concepts 
than standard LDA (or HDP) based topic 
models, and allows for direct incorporation of 
semantic features. We demonstrate the util- 
ity of this representation on multilingual blog 
data and the Congressional Record. 

1 Introduction 

Information overload is a ubiquitous problem that af- 
fects nearly all members of society, from researchers 
sifting through millions of scientific articles to Web 
users trying to gauge public opinion by reading blogs. 
Even as information retrieval (IR) methods evolve to 
move beyond the traditional Web search paradigm to 
more varied retrieval tasks focused on combating this 
overload, they remain reliant on suitable document 
representations. While all representations ultimately 
distill the contents of a document collection into fun- 
damental concepts^ representing the atomic units of 



information, a particular choice of representation can 
have potentially drastic consequences on performance. 

For instance, a common desideratum of retrieval tasks 
is diversity in the result set [5 . Here, the document 
representation must be expressive enough such that 
it is recognizable when two documents are about the 
same idea. A fine-grained representation (e.g., indi- 
vidual words or named entities) may lead to many 
concepts with the same meaning. For example, "Presi- 
dent Obama," "Barack Obama," "Mr. President" and 
"POTUS" are all distinct named entities that refer to 
the 44th President of the United States, but may end 
up as distinct concepts depending on the representa- 
tion. On the other end of the spectrum, coarse-grained 
representations, such as topics from a topic model [3], 
may confiate together many ideas that are only vaguely 
related. This vagueness is particularly a problem for 
systems that attempt to personalize results to user's 
individual tastes, and as such need to estimate a user's 
level of interest in each concept (e.g., [1, 6 ). 

In this paper, we seek to learn concept representa- 
tions at an appropriate level of granularity, represent- 
ing each concept as a set of words that are functionally 
equivalent for the particular task at hand. We take a 
cue from the computer vision community [15] and refer 
to such concepts as superwords. We desire the follow- 
ing characteristics from our model: 

1. The number of potential ideas that are to be mod- 
eled as concepts is unknown and unbounded, and 
thus our model should be able to handle this un- 
certainty. 

2. Our model should specify a probabilistic interpre- 
tation for how much each document is about any 
given concept, allowing for seamless incorporation 
into IR systems. 

3. We should be able to easily encode semantic in- 
formation about the vocabulary into our model, 
based on the idea that two words occurring in the 
same concept should share a meaning in some un- 
derlying semantic space. 

4. The same concept can be represented differently 
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in different documents (i.e., it can contain slightly 
different sets of words). 

Our approach addresses these properties by relying on 
the machinery of Bayesian nonpar ametric methods. In 
particular, we use a nested beta process prior to provide 
strict sparsity in the set of concepts used in a document 
and the set of words associated with each concept, 
while at the same time allowing for uncertainty in the 
number of concepts fO HTT] IT6 | [T8 ] . Such a prior encour- 
ages sharing of concepts and word choices among docu- 
ments, but provides flexibility for documents to differ. 
For instance. Democrats may say "healthcare reform" 
while Republicans may opt to say "Obamacare," but 
both are referring to the same concept. Previous mod- 
els assume that topics are the same for each document, 
and so to model such a phenomenon, they either cre- 
ate multiple topics to refer to the same idea, or else a 
single conflated topic with probability mass on words 
informed by both populations. We avoid both unde- 
sirable options by using a more expressive prior. 

Moreover, as described above, there is inherent un- 
certainty in the granularity of the concepts. Specifi- 
cally, does one choose more concepts, each with fewer 
words, or fewer concepts, each with more words? This 
is a question of identifiability in the nested beta pro- 
cess. Thus, to further inform our sought-after sparsity 
structure, we augment the prior with a semantic fea- 
ture matrix in which each word in the vocabulary has 
an associated observed feature vector. This matrix has 
the additional benefit of allowing us to fuse multiple 
sources of information. For example, the feature vec- 
tor may capture sentence co-occurrence of words in the 
vocabulary. Such information harnesses the structure 
of the text lost in the simple bag-of-words formula- 
tion. Alternatively (or additionally), this feature vec- 
tor can include non-textual information, such as fea- 
tures of images of each word, or features learned from 
user feedback that can be useful for personalization. 
The semantic feature matrix is modeled as a weighted 
combination of latent concept semantic features, where 
the weighting is based on word assignments to each 
concept across the corpus, thereby encoding semantic 
similarity into concept membership. 

The model we propose in this paper is fundamentally 
different from topic models and other generative mod- 
els of text in that we represent each concept as a sparse 
set of words, rather than as topics that are distribu- 
tions over an entire vocabulary. As such, our concepts 
can be more coherent and focused than traditional top- 
ics. For example, the top row of Figure [l] compares a 
concept learned via our model with a corresponding 
topic from a hierarchical Dirichlet process (HDP) [17 , 
both learned on the same Congressional Record cor- 
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Figure 1: The top row contains word clouds comparing a 
concept learned using our model (a) with a topic from an 
HDP (b), both involving the word "Obamacare." The bot- 
tom row has the same concept and topic, but as displayed 
to users in our user study, described in Sec. |5] 

pus (cf. Sec.[5|j^ We see in (a) that our model is able 
to focus on the salient idea of the word "Obamacare," 
in that it is a pejorative term used by Republicans to 
describe President Obama's health care reform pack- 
age. The HDP topic in (b) provides a much vaguer 
and more diffuse representation. Additionally, follow- 
ing Williamson et al. [21], our prior allows us to de- 
couple the prevalence of a concept from its strength. 
In other words, in our model, it is possible for a rare 
concept to be highly important to the documents that 
contain it, which is a characteristic difficult to obtain 
in traditional topic modeling approaches. 

To recap, the main contributions of this paper are: 

• A novel use of a nested beta process prior to define 
concepts in a documents as sparse sets of words; 

• The ability to elegantly incorporate semantic side 
information to guide concept formation; 

• An MCMC inference procedure for learning con- 
cepts from data; and 

• Empirical results-including a user study-on both 
multilingual blog data and the Congressional 
Record, showing the efficacy of our approach. 

2 Nested Beta Processes 

We wish to model the situation where there are an 
unknown (and unbounded) collection of concepts in 
the world, and each document is about some sparse 
subset of them. A natural way to model this uncer- 
tainty and unboundedness in a probabilistic manner is 
to look to Bayesian nonparametric methods. For in- 
stance, Dirichlet processes (DPs) have long been used 
as priors for mixture models in which the number of 
mixture components is unknown [2 . However, in our 
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^Throughout this paper, word clouds are used to illus- 
trate topics and concepts, where the size of a word is pro- 
portional to its weight or prevalence in the topic or concept. 
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case, rather than assigning each observation to a single 
cluster as done in DPs, we require a featural model, 
where each document can be made up of several con- 
cepts. As such, we base our model on a nested version 
of the beta process [10 , described in this section. 

2.1 Beta Process - Bernoulli Process 

We consider the situation where there is a countably 
infinite number of concepts in the world, each repre- 
sented by a coin with a particular bias, cjj, and a set of 
attributes that define the concept, i/^^-. A process for 
assigning concepts to a particular document could thus 
be to fiip each of the coins, and if coin j lands heads, 
we assign concept j to the document. This process of 
fiipping coins to assign concepts to each document is 
known formally as the Bernoulli process^ since we have 
a c^-^^ ~ Bernoulli{ujj) draw for each concept, where 
c^^^ indicates that concept j is on for document d. 

As we do not know the values of the coin biases uoj and 
concept attributes we wish to place a prior over 
them that encodes our desire for a sparse set of active 
concepts per document. Specifically, if we let, 

oo 

we exploit of the fact that the Bernoulli process has a 
conjugate prior known as the beta process, and write 
B ~ BP{b, Bq) to indicate that B is distributed ac- 
cording to a beta process with concentration parameter 
b > and a base measure Bq over some measurable 
space ^. By construction, the biases ooj lie in the in- 
terval (0,1), and thus if the mass of the base measure, 
a^j = Bo{^)^ is finite, then B has finite expected mea- 
sure and we obtain our desired concept sparsityj^ 

Formally, the beta process is defined as a realization of 
a nonhomogenous Poisson process with rate measure 
defined as the product of the base measure Bq and an 
improper beta distribution. In the special case where 
the base measure contains discrete atoms z, with as- 
sociated measure A^, then a sample B ~ BP{b,Bo) 
necessarily contains the atom, with associated weight 
uji - Beta{bXi,b{l - A^)) (cf. [18 ). 

2.2 Indian Buffet Process 

A Bernoulli process realization c^'^^ from our prior de- 
termines the subset of concepts that are active for doc- 
ument d. As in Thibaux and Jordan [18], due to conju- 
gacy, we can analytically marginalize the beta process 

^It is important to note that, unlike a Dirichlet process 
base measure, Bo(^) need not be equal to 1. The beta 
process is a completely random measure [12 , where real- 
izations on disjoint sets are independent random variables. 



measure B and obtain a predictive distribution simply 
over the concept assignments c^. Taking the concen- 
tration parameter to be 6 = 1 yields the Indian buffet 
process (IBP) of Griffiths and Ghahramani [9 . 

The IBP is a culinary metaphor that describes how 
the sparsity structure is shared across different draws 
of the Bernoulli process. Each document is represented 
as a customer in an Indian buffet with infinitely many 
dishes, where each dish represents a concept. The first 
customer (document) samples a Poisson{a^) number 
of dishes. The dXh customer selects a previously tasted 
dish j with probability rrij/d^ where rrij is the num- 
ber of customers to previously sample dish j. He then 
chooses a Poisson{a^/d) number of new dishes. With 
this metaphor, it is easy to see that the sparsity pat- 
tern over concepts is shared across documents because 
a document (customer) is more likely to pick a concept 
(dish) if many previous documents have selected it. 

2.3 Nested Beta Process 

Above, we described how a beta process prior is well- 
suited to modeling the presence or absence of concepts 
in each document d of a document collection. However, 
we are still left with the task of modeling the presence 
or absence of words in a particular concept. Just as 
was the case at the concept level, a document should be 
more likely to activate a word Hn a concept j if many 
other documents also have word i active in concept j. 

This leads to a natural extension of the IBP culi- 
nary metaphor to include condiments that are added 
alongside each dish. Specifically, after the dth cus- 
tomer (document) selects her dishes (concepts) from 
the Indian buffet, she selects an assortment of chutney s 
(words) to accompany each dish. Analogous to what 
happens at the concept level, if a customer is the first 
to sample from dish j, then she selects Poisson{a^) 
types of chutney. The dih. customer to sample from 
dish j selects chutney i with probability rriji/d^ where 
rriji is the number of previous customers who sam- 
pled chutney i alongside dish j. She then selects 
Poission{a^/d) types of new chutney 

Formally, this results in a nested beta process (nBP), 

5^BP(6i,BP(62,5o)), (2) 
which can equivalently be described as , 

oo oo 

B = J2^,6bp 5* = ^7,.^,,,. (3) 

j=i i=i 

That is, a draw from a nBP is a discrete measure 
whose atoms are themselves discrete measures. Just 
as Bernoulli process draws from the top-level beta 
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Given: Q = A^^g. , where G [0, 1] and qi come from 
our vocabulary. 

for all documents d — 1, . . . , D do 
for all concepts j — 1^ . . . do 

^ Gamma{a7t, 1) 

Bernoulli{(jjj) 



for alH = 1, 



for all words n do 
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for all semantic features k do 
cTfc ^ InvGamma{aa, I3a) 

Xkj ^ N(0, al/K), where K is a positive scalar 
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Figure 2: Plate diagram and generative model. We write (c«;,7) ^ nBP{au;, Q) for the distribution over the coin biases. 



process results in concept assignments c - 
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Bernoulli(^ji) from the lower level beta pro- 
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cess Bj, where /j""* indicates whether word i is active 
in concept j for document d . A related idea of nesting 
beta processes is discussed in pT| . 

It is important to note that, in analogy to hierarchical 
clustering (e.g., the nested Dirichlet process of [H]), 
the nBP is only weakly identifiable in that various 
combinations of dish-chutney probabilities lead to the 
same likelihood of the data. Thus, only the prior spec- 
ification differentiates these entities (e.g., encouraging 
fewer concepts each with more words or more concepts 
each with fewer words.) Hierarchical clustering mod- 
els often propose identifiability constraints such as the 
fact that components that correspond to the same class 
should be closer to each other than to components cor- 
responding to other classes. In our model of Sec.[3) we 
avoid such explicit constraints and instead incorporate 
global information which reduces the sensitivity of the 
model to nBP hyperparameter settings. We do so in 
a way that maintains exchangeability (and thus com- 
putational tractability) of the model. 

3 Capturing Superwords with Nested 
Beta Processes 

Given a document collection of D documents and a 
vocabulary of V words, we wish to model the concepts 
therein as superwords. In particular, this entails iden- 
tifying which concepts j are present in each document 
d (indicated by c^-^^ = 1), along with which words i 
are active in concept j for document d (indicated by 
/j^^ = !)• These binary variables are the nested beta 
process features described in Sec. |2j with c represent- 
ing the chosen dishes and f the chosen chutneys. In 
particular, we assume a prior B ^ 5P(1, 5P(1, Q)), 



with a discrete base measure Q = Yli=i ^i^gi 
Xi G [0, 1] and Qi come from our vocabulary. 



where 



Together, c and f define the makeup of the concepts 
and their presence in each document. However, to tie 
these superwords to the actual document text, we must 
specify a generative process for the observed words. 
First, we model the relative importance of each con- 
cept j in a document d as i.i.d. gamma-distributed 
random variables, tTj^"^ ~ Gamma{aT^^ 1). We then as- 
sociate the nth word in each document with a concept 
assignment drawn from a multinomial distribu- 

tion proportional to tt^^^ 0c^^^ (where refers to the 
element-wise Hadamard product)}^ As such, z^'' can 
only take values j where c^-^^ = 1. Likewise, given 
an assignment = j, a document generates its nth 
word, wif"^ from a multinomial distribution propor- 
tional to ^0fj^\ Here, Oi is a parameter of our model 
indicating the relative importance of words Our feat- 
ural based model decouples concept presence in a doc- 
ument from its prevalence. Additionally, concepts that 
select overlapping sets of words need not have the same 
marginal probability of the shared word(s). A visual 
depiction of the graphical model as well as a summary 
of the full generative process is provided in Figure [2] 

Our model as presented thus far suffers from the weak 
identifiability issues described in Sec. [2] In particular, 
the representational flexibility of the model can lead 
to many concepts, each containing few words, or a few 
concepts, each with many words, with little difference 
in the likelihood of the data between the two cases. 
We address this problem by incorporating semantic 



"^The non-zero elements of the resulting normalized dis- 
tribution are Dirichlet distributed, with dimensionality de- 
termined by the number of ones in c. 

^Alternatively, Oi can be modeled similarly to 7Tj^\ as 
gamma-distributed i.i.d. random variables. 
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information about our vocabulary in the manner de- 
scribed below, (cf. the supplemental material for an 
illustration of this problem on synthetic data.) 

Incorporating semantic knowledge. We assume 
each word i in our vocabulary is associated with an ob- 
served^ real- valued feature vector in an F-dimensional 
semantic space. Our fundamental assumption is that 
words appearing together in a concept should have 
similar semantic representations. 

Beyond addressing identifiability concerns, explicitly 
modeling semantic features of our vocabulary allows 
us to elegantly incorporate a variety of side informa- 
tion to help guide the makeup of concepts. For in- 
stance, while the simplicity of the bag-of- words docu- 
ment representation provides many benefits in terms 
of computational efficiency, much structure is thrown 
away that could be useful to particular retrieval tasks. 
As such, if we want concepts to consist of words with 
related or synonymous meanings, we might consider to 
use sentence co-occurrence counts (e.g., in how many 
sentences do words wi and W2 appear together?) as 
the basis for our semantic features. There is a long his- 
tory of work in linguistics studying such distributional 
similarity, popularized by R. F. Firth, who stated that 
"you shall know a word by the company it keeps," [7 . 
Rather than ignore this structural information, we can 
incorporate it as part of our semantic feature set, as 
we do in the experiments we consider in Sec. |5] 

We model this idea by assuming that concepts are 
associated with latent semantic features in the same 
F-dimensional space. Words that often co-occur in 
concept j are expected to have semantic representa- 
tions that are well-explained by the features for con- 
cept j. More formally, we consider an observed ran- 
dom matrix Y of dimensions F x V, where each col- 
umn Y.i corresponds to the semantic feature vector 
for word i. Likewise, X is a random matrix with F 
rows and a count ably infinite number of columns, one 
per concept. As words can simultaneously be active 
in multiple concepts (e.g., "jaguar" can refer to both 
a car and an animal), we assume that the expected 
value of the feature representation for word i, 
is equal to a weighted average over X.j for all con- 
cepts j, with weights proportional to the number of 
documents d with fjf = 1. In matrix notation, we 
write, E[Y] = where the weight matrix <l> is de- 
terministically computed from all c and f , such that 

^It is important to note that in order to incorporate the 
semantic information into our generative model, we rely 
on the fact that our concepts are explicitly represented as 
sparse sets of words. The inclusion or exclusion of a word 
in a concept directly informs which concepts are respon- 



By assuming independence of features and Gaussian 
noise, we can then specify the generative distribution 
for Y as described in Figure |2j In particular, we place 
conjugate normal and inverse gamma priors on the la- 
tent concept features X and the variance terms cr^, 
respectively, allowing us to analytically marginalize 
them out for inference. 

Example: Learning multilingual concepts from 
images. There are no modeling restrictions on the 
semantic data other than we expect the features to be 
real-valued and with zero mean. Hence, we can take 
advantage of this flexibility to model semantics in a va- 
riety of different forms, not limited to simply textual 
features. To demonstrate this flexibility, we consider 
the following toy problem: given a small collection of 
dessert recipes in English and German, downloaded 
from food blogs, can we learn concepts that are coher- 
ent across the two languages? Based on image-based 
semantic features, we address this problem without re- 
lying on parallel corpora or an explicit dictionary as in 
the multilingual topic modeling of [H |T3] . Specifically, 
assuming we have an image associated with each word 
in our vocabulary (e.g., from a Google images search), 
our semantic feature model encourages all concepts 
j that choose to include word i to have latent fea- 
ture vectors that are similar to the image-based fea- 
ture vector associated with word i. We hypothesize 
that, despite a lack of co-occurrence between the two 
languages within this small corpus, we should still ob- 
tain reasonable multilingual concepts, because an ap- 
ple looks like an apfel^ eggs look like eier, and so on. 

Specifically, for each of 125 English and German vo- 
cabulary words, we first collect the top three search 
results from Google Images We transform these im- 
ages following the approach of Oliva and Torralba |14] , 
to get simple, 10-dimensional GIST-based features for 
each word, which we use as the semantic features. We 
then run our sampling procedure from Sec.[4]for 10,000 
samples, which we use to infer the marginal probabili- 
ties of any two words being active in the same concept. 

In order to quantitatively verify our hypothesis, we 
take this marginal probability matrix and, for each 
word Wi^ rank all other words by probability of co- 
occurrence with Wi. We then create a ground truth set 
by finding all pairs of words in our vocabulary that are 
considered English- German translations of each other 
according to Google Translate. We come up with 18 

sible for the semantic meaning of that word. To incor- 
porate the same information in traditional topic modeling 
approaches, we would have all topics contributing to the 
semantic meaning of every word. 

^Images for English words were retrieved from 
Google.com, and German words from Google.de, to avoid 
any internal translation that Google might otherwise do. 
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Figure 3: Rank accuracy of multilingual concepts inferred 
from English and German recipes; our model, combining 
image and corpus data, outperforms predictions based on 
either text or images alone. 



such synonym pairs. We expand each set to include 
direct synonyms that differ only in case or number, 
giving us pairs of sets like {egg^eggs} ^ {eier}. For 
each such pair, if Wg is a German translation of w^^ we 
see how high We is ranked on Wg^s marginal probabil- 
ity ranking, and compute the rank accuracy^ which is 
the percentage of the words We is ranked higher than. 
We do this in both directions (from German to En- 
glish and vice versa) for all 18 set-pairs. For com- 
parison, we also compute rank accuracy when, rather 
than using our model, we rank words directly based 
on their L2-distance in the 10-dimensional GIST fea- 
ture space. Figure [3] shows that our model, combin- 
ing corpus-based concept modeling with simple image- 
based semantic features improves performance over us- 
ing the image features alone. Of course, as we expect, 
if we use our model without incorporating the images, 
we achieve poor performance, as there is little cross- 
language information in the text. 

Anecdotally, we find several concepts that do in 
fact represent the types of cross-language coher- 
ence that we hoped for. For example, one concept 
consists primarily of kitchen tools and utensils, { 
whisk, rubber spatula, baking sheet, butter, torten- 
heber (cake server), schneehesen (whisk), weizen- 
mehl (wheat flour), mehl (flour) }, while another is 
heavy on mostly white and dry ingredients, { corn- 
starch, sugar, sugars, zucker (sugar), zuckers (sug- 
ars), vanillezucker (vanilla sugar), salz (salt), ricotta- 
kuchen (ricotta cake), ricotta, butter, piirierstab^ teig 
(dough), mandel-zitronen-tarte^ ofen^ springform }. 
While neither of these concepts is perfect, they are still 
indicative of the flexibility and power of our semantic 
representation in guiding concept formation. 



interleaves Metropolis-Hastings (MH) and Gibbs sam- 
pling updates. Given that we are primarily interested 
in the concept definitions (f) and their existence and 
prevalence in each document (c and tt), we marginal- 
ize out all other random variables, and end up with a 
collapsed sampler. This marginalization is analytically 
possible due to the conjugacy that exists throughout 
our model. Such collapsing is particularly critical to 
the performance of our sampler given the number of 
binary indicator variables in our model. For instance, 
if we sample Cj^"^ without having marginalized out z, 
we are forced to keep it set to 1 as long as any word n 



in document d has z^, 
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Another important consideration is that, while our 
model represents an infinite number of concepts, only 
a finite number are ever instantiated at any given 
time. Thus, due to context-specific independence in 
our graphical model, we find that the variables fj^^ 
can be pruned from the model for cases where 
0, and sampled only on an as-needed basis. 
At a high level, our sampling approach is as follows: 

Algorithm 1 High-level MCMC inference procedure 

1: Sample c^^) |w(^), tt^^), f c^-^), F for every doc- 
ument d using an MCMC procedure that proposes 
births/deaths for unique concepts. 

2: Sample /jf | w^^) ,7V^^\ c^^^ , fi^l^^. , Y for every doc- 
ument d, and every concept j present in d, and 
every word i in the vocabulary. 

3: Impute zf^ | w(^) , c(^) , f (^) , 7r('^) . 

4: Use the imputed z variables to aid in sampling 
7rW|zW,cf ,7r5. 

For notational convenience, we write out two likelihood 
terms for sampling c^^^ and f^^^. First, 

P(w('^)|c('^),7r('^),fW) = 



/ 










E ^ 


n( 


E 






wGdoc d \ 





(4) 



where flw ^ counts the occurances of word w in docu- 
ment (i, and Nd is the number of words in document 
d. Second, by marginializing X and and recalling 
that ^ is determined from c and f , 



H(d) 



4 MCMC Computations 

In order to perform inference in our model, we employ 
a Markov chain Monte Carlo (MCMC) method that 



k=i 



(a,+y/2) 



(5) 



where C is the set of active concepts, i.e., C = {j : 
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Figure 4: (a) Histogram illustrating the word-level spar- 
sity our model provides. (b),(c) Word clouds representing 
the same concept, but instantiated in Democratic and Re- 
publican members of Congress, respectively 

Eti2r>0},and, 



(6) 



The full derivations for the conditional distributions 
below can be found in the supplementary material. 

Sampling concept assignments to documents 

c^'^^ In the case where we are sampling a concept 
that is shared among other documents in the corpus 
(i.e., c^j^"^ = 1 for some e ^ d), we can sample from the 
conditional distribution: 



P(cf |w('^),7r('^),fW,cL1,c(-^\r)oc 



(7) 



By IBP exchangeability, we assume that the current 
document is the last one, allowing us to write the prior 
probability on c^-^^ (the first factor in this expression) 
as P(c^-^^|c^- = m^j ^^/D, where mj is the number 
of documents with c^-^^ = 1. 

In order to sample unique concepts, we follow Fox et 
al. and employ a birth/death MH proposal. More 
details can be found in the supplemental material. 

Sampling concept definitions f This is similar to 
sampling a shared concept c^-^^ , except that the domain 
(i.e., the vocabulary) is finite. In particular, we sample 
from the conditional 



where, again by exchangeability, P(/j^ |f 
{m^ji ^^+Ai)/mj. Here, mj is the number of documents 



with c, 



id) _ 



e ^ d assigning f | • — 1 . 



1 and rrij 

(e) 



is the number of documents 



Imputing z^^^ and Sampling tt^^^ This is straight- 
forward and details are left for the appendex. 

5 Empirical Results 

In order to evaluate how well our sparsity-inducing 
nested beta process prior gives us the concept co- 
herency and flexibility we desire, we compare concepts 
learned from our model to topics learned from an HDP. 
The HDP is a good model for comparison because it is 
widely used as a probabilistic model for document col- 
lections with an unknown number of topics, but does 
not model sparsity at either the topic or word levelj^ 
Our empirical evaluations in this section are all on a 
corpus drawn from the Congressional Record during 
the health care debate that marked the first two years 
of Barack Obama's presidency. Each document here 
represents a member of Congress (either in the Senate 
or the House), and is a compilation of their statements 
from the chamber floor during a few days of debate. 
For analysis, we selected the documents correspond- 
ing to the 100 most active speakers, as measured by 
length of their floor transcripts. After standard stop 
word removal, we extracted named entities and noun 
phrases [19], and removed tokens that did not appear 
in at least 10 of the 100 documents, leaving us with a 
vocabulary size of 682. We augmented this corpus with 
50 semantic features, computed by taking a sentence- 
level word co-occurrence matrix and projecting it to 
the top 50 principal components (which account for 
approximately 60% of the variance). 

We run two parallel chains of an HDP sampler [20] 
for 5,000 iterations, and choose the sample with the 
maximium joint likelihood across both chains for our 
comparisons. We also run two parallel chains of our 
inference procedure from Sec. [4] each to 1,000 samples, 
and again select the most probable sample 

As illustrated in Figure [T] and previously discussed 
in the introduction, the concepts our model finds are 
more coherent and focused than the topics learned by 
HDP. The word sparsity of our concepts is evident 
in the histogram of Figure 4(a) , showing that 99% of 
the empirical probability mass is on 25 concepts/word 
or fewer. Figure [5] shows three additional concepts 
that are illustrative of the benefits of incorporating our 
sparsity-inducing prior with semantic features. For ex- 



ample, the concept found in Figure 5(c) concisely rep- 
resents the so-called donut hole (or "doughnut hole," 

^The recent focused topic model (FTM) of Williamson 
et al. [21 provides concept-level sparsity, but topics are 
still distributions over the entire vocabulary. 

^In the supplemental material, we describe more details 
on how we initialize our sampler, hyperparameter settings, 
and modifications to our sampling procedure necessary for 
running efficiently on larger corpora. 
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Figure 5: Word clouds representing three concepts generated from our model, applied to the Congressional data set, 
using sentence co-occurrance-based semantic features. 



as we find both spellings) in Medicare's prescription 
drug benefit, known as Medicare Part D. 

In order to measure this coherency in a more quanti- 
tative manner, we conducted a user study where we 
presented participants with pairs of word clouds cor- 
responding to ten key ideas from the health care de- 
bate (one cloud from our model and the other from 
HDP), and asked them to choose the cloud from each 
pair that provides the more coherent description of the 
word. For fairness, we asked a public policy expert to 
select ten words from our vocabulary representing the 
most salient ideas in the debate, and used these words 
to generate the word clouds. 

For a given word, we found the HDP topic that as- 
signed it the highest probability, as well as the con- 
cept from our model that included it the most times 
across the corpus. The word clouds were generated 
by removing the word in question and then display- 
ing the remaining words proportional to their weight. 
To hide the identity of our approach from the partic- 
ipants, we truncate each HDP topic to have the same 
number of words as the corresponding concept from 
our model. This gives the HDP topics an illusion of 
sparsity that they do not naturally have, and hides 
one of the key advantages we have over topic models. 
For example, the "Obamacare" word clouds shown in 
the top row of Figure [l] are transformed to the ones in 
the bottom row for the purpose of this study. Despite 
handicapping our model in this manner, users found 
that our model produces more coherent concept rep- 
resentations than the HDP. Specifically, 73% of the 34 
participants prefer our concepts to the HDP topics, 
with a mean preference of 5.74/10. Moreover, we note 
that the task itself relied on a subtle understanding of 
American politics. For example, overall, more partici- 
pants preferred the HDP word cloud for "Obamacare" 
than ours, but when only considering participants who 
claimed to have followed the health care debate closely 
(and presumably understand that "Obamacare" is a 
term used pejoratively by Republicans), this prefer- 
ence is fiipped to ours. More details on the study can 
be found in the supplemental material. 

Finally, since our model allows concepts to have differ- 
ent words active in different documents, we can illus- 
trate the fiexibility of our prior by creating different 



word clouds for the same concept, each representing 
a subset of documents. On this data set, a natu- 
ral separation of the documents is to have two par- 
titions, one representing Democrats and the other Re- 
publicans. Figure [4] shows two word clouds from our 
model representing the same concept, but one comes 
from Democrats and the other from Republicans. We 
can see the qualitative difference in the word weights 
between the two populations, with "discrimination" 
and "women" being more active in this concept for 
Democrats while "employer" and "market" being more 
associated with Republicans. We can find several simi- 
lar anecdotes at the level of individual documents. For 
example, if we look at the leadership of the two parties, 
we find that in one concept, the word "Obamacare" 
is active for Republican Eric Cantor whereas "health 
care bill" is used for Democrat Chris Van Hollen. 

6 Discussion 

Motivated by critical problems in information re- 
trieval, in this paper, we introduce a novel model- 
ing technique for representing concepts in document 
collections. Popular generative models of text have 
tended to focused on representing the ideas in a docu- 
ment collection as probability distributions over the 
entire vocabulary, often leading to diffuse, uninfor- 
mative topic descriptions of documents. In contrast, 
our discrete, sparse representation provides a focused, 
concise description that can, for example, enable IR 
systems to more accurately describe a user's prefer- 
ences. A second key contribution in this work is that 
our model naturally incorporates semantic information 
that is not captured by alternative approaches; for ex- 
ample, we demonstrated that images can be used to 
create multilingual concepts, without any other trans- 
lation information. Such side information can often be 
easily obtained, and help guide topic models towards 
semantic meaningful predictions. 

Despite the promise, our model suffers from common 
limitations faced by many Bayesian nonparametric 
methods. While incorporating semantic features was 
helpful from a modeling perspective, the matrix oper- 
ations required to incorporate this data can be expen- 
sive. In experiments, we utilized parallelization and 
approximation techniques to reduce the running time. 
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but scalability remains a key aspect of future work. 

We believe that the notion of superwords, charac- 
terized by sparse concepts and the incorporation of 
side information through semantic features, can sig- 
nificantly improve the effectiveness of IR techniques 
at capturing the nuances of users' preferences. 
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A Supplementary IMaterial 

In this appendix, we provide detailed derivations of our 
MCMC sampling procedure, as well as more details on the 
data we use for our experiments. 

A.l MCMC Derivations 

In our sampling procedure, we assume that cj, 7 and z 
are marginalized out of our model. However, in order to 
sample f and tt, we impute values for z, which we dis- 
card at every iteration. As described in the main body 
of the paper, we employ a Gibbs sampler that features 
birth/death Metropolis Hastings steps for sampling c and 
a Gibbs- within- Gibbs sampler for jointly sampling X and 
f. 

A.1.1 Sample c^^) |w(^), tt^^), f , c^-^), Y 

We assume that, at this moment, we have J total concepts 
active across all documents. J^~^^ are active in all other 
documents, while J^-' are active only in this document 

(i.e., J — J^"^-* + J^^). Without loss of generality, we as- 
sume that the concepts are numbered such that the shared 
concepts (1, 2, . . . , J^~^^) come before the unique concepts 
(j(-^) + l,...,J). 

Sampling a shared concept Here, we con- 
sider sampling cf^ | w^^^ , tt^^^ , f , c^] , c^~^^ , u where j G 

{1, . . . , J^"^'*}, i.e., concept j is shared. By Bayes' rule, as 
well as conditional independencies in the model, we have 
that: 

P{cf I w^^^ , TT f , cL^j , c^-^) , Y) (9) 

oc P(w(^\ Y|c, 7r(^\ f ) . P(cf |c^-')) 

^p(^(c^)|^(c^)^^(c^)j(d)).p(Y|<l>) •P(4^^|4^ (10) 

By IBP exchangeability, we assume that the current doc- 
ument is the last one, allowing us to write the prior 
probability on c^^^ (the last factor in this expression) as 
P{c^^^\c'~^^) — m^~^^ /D, where rrij is the number of doc- 
uments with c^^^ — 1. 
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The first factor, which we call the text likelihood term, can 
be simplified as follows: 



P(w(''>|c(''\7r(''\f(''>) 
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where is the count of word w in document and Nd 
is the total word count for document d. Note that if there 
is a word i that appears in document d such that the only 
concept that explains it is j (i.e., f^f^ — 1 but f^f^ — for 
all k j), then c^-^^* must be set to 1. 

The second factor, P(Y|$), which brings in influence from 
the semantic features, is derived at the end of this supple- 
mentary material. 

We sample from this conditional distribution using a 
Metropolis Hastings step, where we have a deterministic 
proposal that flips the current value of c^j^-' from c to c. 

Specifically, we flip c^^'* with the following acceptance prob- 
ability: 

p{c\c) — min 



p(^id) 



P(M^) = c| w(^) , 7r(^) , f , c^_^] , c(-^) , u) 



Note that if we consider flipping c^^'* from to 1, we will 

need to first sample values for f^^-' and tt^^'' from their 
priors, since they wouldn't otherwise exist. We sample 
TTj^"^ from Gamma{aT^ , 1) and sample /j^'* from its prior in 
Equation [47l 



(15) 



Sampling unique concepts Let c^^ be the current 

unique concepts for document d, and f^'' and tt^'* be their 

associated parameters. (Let fi^'* and tt^'* be the same 
for shared concepts.) 

To sample these unique concepts, we'll use a birth/death 
proposal distribution, which factors as follows: 



qc{cf\c^f) (16) 



. (ff ' Iff , ' , ),. (^f ef' , ). 

In our IBP, the Dth customer is supposed to sample 
Poisson{aoj / D) new dishes, and therefore the probability 
of a concept birth (which we call ri{J^^)) is the probabil- 
ity that such a draw would result in more than the current 
number of unique concepts, In particular, we define 

7y(j|^^) = 1 - PoissonCDF{J^^^]a^/D), for J^f > 0. If 



there are no unique concepts, then we are forced to propose 
a birth, and as such, 77(0) = 1. 

Thus, the proposal ^c(c^'* 1^+^*) adds a new concept 
with probability ri{J^^)^ and kills off each of the current 

concepts with probability — ^ ^ . The proposals 

for / and tt draw from their priors {Bernoulli{X) and 
Gamma{a-rr, 1), respectively) for a concept birth, and oth- 
erwise deterministically maintain the current values for the 
existing concepts. 

Given this definition for our birth/death proposal, we can 
now write down our Metropolis Hastings step for sampling 
the unique c^^ and associated parameters. In particular, 
we accept the Metropolis Hastings acceptance ratio is given 
by 

P (cf , ff , nf I wC*) , c^'^ fi''' ,^^^\f^-^\ y) 



(17) 
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Using Bayes' Rule and conditional independencies of our 
model, we can rewrite the first fraction as follows: 

P (wW|[c<_">c^,'')], [i^^^^i^f], [fi^'ff ]) ■ 

P (y|[c<_">c^,'')'], [Wf]) P (cf ) P (ff ) P (tt^' ) 



P (y|[c<_">c^,'')], [fi<^>ff ]) P (c^,'')) P (ff ) P {j^f) 



(18) 



Likewise, by definition, we can factor the second fraction 
as follows: 

gc(£y>ley>k/(frlr^ey^ef)'/4^fky^ey^ey^) 

g.(e?>>f)a/(fr'|fr.ef\cf),4-?n-?\er>e?V 

(19) 

Plugging in the corresponding terms, if we propose a birth, 
we can simplify the acceptance ratio to: 



p {v.w\[^^^hf], [i^^^^i^f], [fi^'ff ]) 

Poisson{.j''f + 1; a^/P')(l - + 1)) 

' Poisson{jf;a^/D){J^^'^ + l)r,{jf) 

piYWc^'^cf'lfnf]) 

■ ^- (20) 

P (Y\[c'-^'>c'-f],[i'i'>il^'>]j 
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Likewise, if we propose to kill a concept, we have exactly 
the same form for the acceptance ratio, replacing the third 
line by: 



Poisson{ji^^ - l;a^/D)j[^^r]{j[^^ - 1) 



Poisson{J^f-a^/D){l - viJ^)) 



(21) 



Once again, we can work in terms of our utility random 
variables tt^^^ to form a draw from the desired Dirichlet 
posterior (again, along the non-zero components). Specifi- 
cally, we draw 



Try \z^^' ~ Gamma{7rj , QfTr + n^- , 1). 



(27) 



A.1.2 Impute , • • • , ^^^j | w(^) , c(^) , f , tt^^) 

Forn- l,2,...,Nd: 
By Bayes' rule: 



(d) 



\^(d)^^{d)y (22) 



If Cj^^ = 0, necessarily we will not have any counts of con- 
cept j in document d (i.e., n^j^'' — 0), implying that the 
distribution of these utility random variables remains the 
same. Thus, in practice we only need to resample the util- 



ity random variables tt^^^ for which c^"^ = 1. 



A.1.4 Sample ff\^^^\i%yh-^)Md),c 



Note that if c 



or /i5 



0, the probability of as- First, we note that f • only exists in documents d where 



signmg Zn' = 2: is 0. Therefore, when imputing the value 
of zi^^ (associated with word w)^ we only need to consider 
values z such that c^f^ = 1 and fzw = 1. In this case, we 
sample z from: 




(d) 



(23) 



Note that the numerator of the first factor and the denom- 
inator of the second factor are the same across all values 
of ^, so we sample zif-* = z (for z such that ci^^ = 1 and 



fzw =1) proportional to: 



Jd) 



(24) 



A.1.3 Sample ^ |z(^), tt^, c(^^ 

One can think of the concept frequency random variables 
TTj^^ as utility random variables aimed at modeling the con- 
cept frequency distribution 



(25) 



Specifically, only considering the non-zero components of 
TT*^"^-* (specified by the c^-^^ = 1), this distribution is Dirichlet 
distributed and the gamma random variables are used in 
constructing a draw from this distribution. The purpose 
of utilizing the tt^^'* in place of working with tt^^-* directly 
is because the dimensionality of the underlying Dirichlet 
distribution is changing as c^^^ changes and thus we can 
maintain an infinite collection of gamma random variables 
that are simply accessed during the sampling procedure. 

We have multinomial observations z^^^ (representing the 
word-concept assignments) from the concept frequency dis- 
tribution 7t^^\ Due to the inherent conjugacy of multino- 
mial observations to a Dirichlet prior, the posterior of n^^^ 
is (using a slight abuse of notation): 



;^(d) I (d) ^{d) 



Dir(K + n^^\ a. + , • • • ] e^^O 
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c^j^"^ = 1. (If c^-^'' = 0, it means that all observations w^^-* 
are independent of fj^\ and thus such / nodes can be 
pruned.) Second, while, due to conjugacy, we can write 
the conditional distribution for fjf"^ without using z^^-* (as 

was the case when sampling c), we use the imputed z^^^ 
here for computational efficiency. 

We have the following: 



p(/ifiw(^),f!:^),),f(-'^),z('^),c) 
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(28) 
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(31) 



The first factor in this expression is the corpus likelihood 
term, assuming z^f' . We can simplify this as follows: 
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(32) 
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(34) 
(35) 



Recall that the second factor is the semantic likelihood 
term, and is derived at the end of this supplementary ma- 
terial. 
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Finally, the integral, representing the prior probability on 
fji, is simplified as follows: 
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B{mji + Xi,mj - rriji + 1 - 



B(A„1 - A,) 
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where mji is the number of documents with both c^j^-' — 1 



and /jf = 1. Thus, if = 1, we have: 



^^•7'^ + A. + l)r(m, - m^.-') - AO = (43) 
(m^.-^) + AOr(m^.-^) + AOr(m, - m^r^^ + A.), (44) 



and if /jf ^ = 0, we have: 



(45) 



= r(m(-'^) + K){m, - m}r-> - K)V{m, - m'r'^' - A,). 

(46) 

After some cancellation, we get the following: 
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To recap, we sample /•f'* as follows 



(48) 



1. if n^j^'' — 0, we sample f^f^ proportional to 
P(Y|c,f)P(/<f|fj-^)). 

2. if n^"'* > but n^^'* = 0, we sample proportional to 



P(Y|c,f)P(/<f|fj-^))(E,;(.)=l 



3. else, if n^f'^ > and n^f > 0, set /jf = 1 with 
probability 1. 



where MN denotes a matrix normal distribution and IW 
and inverse Wishart. Here, M defines the mean matrix 
which S and K define the left and right covariances of 
dimensions F x F and J x J, respectively. Typically, K is 
assumed to be diagonal with K = diag(/ci, . . . , /cf)- 

Using the fact that our features are independently Gaus- 
sian distributed, we can write 



Y I X, S, $ - MN{X^, S, Ij) 



(51) 



The prior for X, S above is conjugate to this likelihood, 
so we can analytically compute the marginal likelihood. 
Standard matrix normal inverse Wishart conjugacy results 
yield 



P(Y I $) 
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Here, 



^v\v ~ ^yy ^yy^yy ^yy 

Syy = $$' + 

Syy = Y$' + MK 

Syy = YY' + MKM'. 



(52) 



(53) 
(54) 
(55) 
(56) 



In the case of an unbounded number of concepts, we re- 
strict our attention to noise covariances S that are diagonal 
(S = diag(cri , . . . , cr|^)). This implies 



Xkj I al,kj ^ N{Mkj,al/kj) 



(57) 
(58) 



(47) independently for all k. 



Although we are not working with a finite model, the $ 
matrix implicitly truncates our model. Let Xc represent 
the set of latent concept features associated with the in- 
stantiated concepts and $c the non-zero columns of $ as- 
sociated with the active concepts. Then, our model above 
is equivalent to 



Y|Xc, $c - MN{Xc^c, S, /c). 



(59) 



Using the marginal likelihood formula of Eq. ( |52| ), and sim- 
plifying based on the factorization of the likelihood across 
the dimensions of our feature space, yields 



p(y|$) 



l^c.^?^. +i^ccr/2(27r)^W2r(c,)F 



(60) 



A. 1.5 Determine p{Y\^) 

Recall that F denotes the dimensionality of our seman- 
tic features. If the number of concepts were finite with J 
concepts, we could specify 



X I S - MN{M,^,K) 
S-IW(no,6'o), 



(49) 
(50) 



A. 2 Synthetic Example 

The phenomenon of weak identifiability is illustrated in 
Figure [6] We generate synthetic data for 100 documents, 
assuming the underlying concept representation depicted 
in Figure [6 (a) We then run our Gibbs sampler (cf. Sec.|4| 
for 1,000 samples in order to in fer the concept definitions 
for each document. Figure 6(b) shows the average concept 
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5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 

words words words 



(a) Ground truth (b) No semantic features (c) With semantic features 

Figure 6: Synthetic example illustrating weak identifiability of the nested beta process and how we overcome it using 
semantic features. For each active concept, the average f^^-* vector over all documents d is plotted, with white indicating 
presence of a word in a concept, black indicating absence, and gray for average values of f*^"^-* in between and 1. 



definitions f '^^-^ across all documents, for the final sample. 
We see that while most of the concepts are correctly recov- 
ered, the sampler has difficulty reconstructing the concept 
representing words 11 to 15. However, by incorporating ad- 
ditional semantic information about our vocabulary in the 
manner described below, we are able to properly recover 



all five concepts, as seen in Figure 6(c) 



A. 3 User study results 

We filtered user study participants to make sure they had 
followed the health care debate at the least at the level of 
reading the headlines. Of these 34 participants, we first 
measured how many of the ten questions resulted in a fa- 
vorable vote for our model as compared to HDP, and found 
it to be 5.74 on average: 

Num: 34 

Average: 5.7352941176471 
Standard Dev: 1.5398400132598 
T-test 95% conf . interval : 

[5 . 2176965657521 , 6 . 252891669542] 

We then asked, how many participants preferred our wor- 
dles to HDP, treating each participant as a Bernoulli sam- 
ple, and ignoring the ties: 

Num wins: 16 
Num losses: 6 

Binomial mean: 0.72727272727273 
Binomial std: 0.44536177141512 
Binomial Sign Test 95% conf. interval: 
[0 . 54116788781464, . 91337756673082] } 



A. 4 Sampler details 

We initialize c and f in our sampler from a simple k-means 
clustering of the words using the semantic features. It is 
interesting to see how, after many samples, how a concept's 
definition changes from the initialization. Figure [7| shows 
one particular concept, and how our model reweights the 
words and adds/removes words from the initial cluster. 

We ran our sampler with the following hyperparameter set- 
tings: 
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(a) K-means initialization (b) MAP sample 

Figure 7: Concept changes from initialization 
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In this user study, you will be presented with 10 pairs of word clouds describing 
various topics from the U.S. health care reform debate of 2009-2010, and will be 
asked a few questions. This study should take 5-1 minutes. 



Do you follow US politics? 

Yes No 



How much did you pay attention to the US health care debate of 2009-2010? 

• I followed it closely. 

• I paid attention to the main headlines. 

• Verylittj^ 



1. Which of the two word clouds more coherently describes the concept 

COSTS? 



this one 
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ability 

government 
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this one 
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2. Which of the two word clouds more coherently describes the concept 

REPEAL? 



this one this one 
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3. Which of the two word clouds more coherently describes the concept 

PREMIUMS? 



this one 



1 person 

areas 



glmmickstaxes %bl; proyr°|ldditiO^ 

medicare '""mandate 
impact '''"''^federal 

1 1 1 1 ^ policy 



increase 



this one 
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floor . , t:osts _ 



ill benefits 



sidemri 



^states country 



state house amenca 

senator 



4. Which of the two word clouds more coherently describes the concept 

MANDATES? 



this one this one 
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5. Which of the two word clouds more coherently describes the concept 

INSURANCE? 



this one 
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this one 
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6. Which of the two word clouds more coherently describes the concept 

MEDICARE? 



this one this one 
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7. Which of the two word clouds more coherently describes the concept 

STATES? 



this one this one 
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8. Which of the two word clouds more coherently describes the concept 

EMPLOYERS? 



this one 
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this one 
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9. Which of the two word clouds more coherently describes the concept 

OBAMACARE? 



this one 
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10. Which of the two word clouds more coherently describes the concept 

COVERAGE? 



this one 
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this one 
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Make sure you've made a selection for each one! 



C Submit J 



